{
 "metadata": {
  "name": "",
  "signature": "sha256:131a28c847311652b0351e192b99b85a0b013dad3c13e43232a227bc1622c463"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Tutorial\n",
      "\n",
      "Example of how to generate a working submission for https://www.kaggle.com/c/datasciencebowl\n",
      "based on the offical competition tutorial as it appears at https://www.kaggle.com/c/datasciencebowl/details/tutorial"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#Import libraries for doing image analysis\n",
      "from skimage.io import imread\n",
      "from skimage.transform import resize\n",
      "from sklearn.ensemble import RandomForestClassifier as RF\n",
      "import glob\n",
      "import os\n",
      "from sklearn.cross_validation import StratifiedKFold as KFold\n",
      "from skimage import measure\n",
      "from skimage import morphology\n",
      "import numpy as np\n",
      "import pandas as pd"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import warnings\n",
      "warnings.filterwarnings(\"ignore\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 2
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Importing the Data\n",
      "\n",
      "The training data is organized in a series of subdirectories that contain examples for the each class of interest. We will store the list of directory names to aid in labelling the data classes for training and testing purposes."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# get the classnames from the directory structure\n",
      "directory_names = list(set(glob.glob(os.path.join(\"competition_data\",\"train\", \"*\"))\\\n",
      " ).difference(set(glob.glob(os.path.join(\"competition_data\",\"train\",\"*.*\")))))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 3
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# find the largest nonzero region\n",
      "def getLargestRegion(props, labelmap, imagethres):\n",
      "    regionmaxprop = None\n",
      "    for regionprop in props:\n",
      "        # check to see if the region is at least 50% nonzero\n",
      "        if sum(imagethres[labelmap == regionprop.label])*1.0/regionprop.area < 0.50:\n",
      "            continue\n",
      "        if regionmaxprop is None:\n",
      "            regionmaxprop = regionprop\n",
      "        if regionmaxprop.filled_area < regionprop.filled_area:\n",
      "            regionmaxprop = regionprop\n",
      "    return regionmaxprop"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 4
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def getMinorMajorRatio(image):\n",
      "    image = image.copy()\n",
      "    # Create the thresholded image to eliminate some of the background\n",
      "    imagethr = np.where(image > np.mean(image),0.,1.0)\n",
      "\n",
      "    #Dilate the image\n",
      "    imdilated = morphology.dilation(imagethr, np.ones((4,4)))\n",
      "\n",
      "    # Create the label list\n",
      "    label_list = measure.label(imdilated)\n",
      "    label_list = imagethr*label_list\n",
      "    label_list = label_list.astype(int)\n",
      "    \n",
      "    region_list = measure.regionprops(label_list)\n",
      "    maxregion = getLargestRegion(region_list, label_list, imagethr)\n",
      "    \n",
      "    # guard against cases where the segmentation fails by providing zeros\n",
      "    ratio = 0.0\n",
      "    if ((not maxregion is None) and  (maxregion.major_axis_length != 0.0)):\n",
      "        ratio = 0.0 if maxregion is None else  maxregion.minor_axis_length*1.0 / maxregion.major_axis_length\n",
      "    return ratio"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 5
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Preparing Training Data\n",
      "\n",
      "With our code for the ratio of minor to major axis, let's add the raw pixel values to the list of features for our dataset. In order to use the pixel values in a model for our classifier, we need a fixed length feature vector, so we will rescale the images to be constant size and add the fixed number of pixels to the feature vector.\n",
      "\n",
      "To create the feature vectors, we will loop through each of the directories in our training data set and then loop over each image within that class. For each image, we will rescale it to 25 x 25 pixels and then add the rescaled pixel values to a feature vector, X. The last feature we include will be our width-to-length ratio. We will also create the class label in the vector y, which will have the true class label for each row of the feature vector, X."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Rescale the images and create the combined metrics and training labels\n",
      "\n",
      "#get the total training images\n",
      "numberofImages = 0\n",
      "for folder in directory_names:\n",
      "    for fileNameDir in os.walk(folder):   \n",
      "        for fileName in fileNameDir[2]:\n",
      "             # Only read in the images\n",
      "            if fileName[-4:] != \".jpg\":\n",
      "              continue\n",
      "            numberofImages += 1\n",
      "\n",
      "# We'll rescale the images to be 25x25\n",
      "maxPixel = 25\n",
      "imageSize = maxPixel * maxPixel\n",
      "num_rows = numberofImages # one row for each image in the training dataset\n",
      "num_features = imageSize + 1 # for our ratio\n",
      "\n",
      "# X is the feature vector with one row of features per image\n",
      "# consisting of the pixel values and our metric\n",
      "X = np.zeros((num_rows, num_features), dtype=float)\n",
      "# y is the numeric class label \n",
      "y = np.zeros((num_rows))\n",
      "\n",
      "files = []\n",
      "# Generate training data\n",
      "i = 0    \n",
      "label = 0\n",
      "# List of string of class names\n",
      "namesClasses = list()\n",
      "\n",
      "print \"Reading images\"\n",
      "# Navigate through the list of directories\n",
      "for folder in directory_names:\n",
      "    # Append the string class name for each class\n",
      "    currentClass = folder.split(os.pathsep)[-1]\n",
      "    namesClasses.append(currentClass)\n",
      "    for fileNameDir in os.walk(folder):   \n",
      "        for fileName in fileNameDir[2]:\n",
      "            # Only read in the images\n",
      "            if fileName[-4:] != \".jpg\":\n",
      "              continue\n",
      "            \n",
      "            # Read in the images and create the features\n",
      "            nameFileImage = \"{0}{1}{2}\".format(fileNameDir[0], os.sep, fileName)            \n",
      "            image = imread(nameFileImage, as_grey=True)\n",
      "            files.append(nameFileImage)\n",
      "            axisratio = getMinorMajorRatio(image)\n",
      "            image = resize(image, (maxPixel, maxPixel))\n",
      "            \n",
      "            # Store the rescaled image pixels and the axis ratio\n",
      "            X[i, 0:imageSize] = np.reshape(image, (1, imageSize))\n",
      "            X[i, imageSize] = axisratio\n",
      "            \n",
      "            # Store the classlabel\n",
      "            y[i] = label\n",
      "            i += 1\n",
      "            # report progress for each 5% done  \n",
      "            report = [int((j+1)*num_rows/20.) for j in range(20)]\n",
      "            if i in report: print np.ceil(i *100.0 / num_rows), \"% done\"\n",
      "    label += 1"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Reading images\n",
        "5.0"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " % done\n",
        "10.0"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " % done\n",
        "15.0"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " % done\n",
        "20.0"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " % done\n",
        "25.0"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " % done\n",
        "30.0"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " % done\n",
        "35.0"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " % done\n",
        "40.0"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " % done\n",
        "45.0"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " % done\n",
        "50.0"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " % done\n",
        "55.0"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " % done\n",
        "60.0"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " % done\n",
        "65.0"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " % done\n",
        "70.0"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " % done\n",
        "75.0"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " % done\n",
        "80.0"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " % done\n",
        "85.0"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " % done\n",
        "90.0"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " % done\n",
        "95.0"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " % done\n",
        "100.0"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " % done\n"
       ]
      }
     ],
     "prompt_number": 6
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "numberofImages"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 7,
       "text": [
        "30336"
       ]
      }
     ],
     "prompt_number": 7
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Random Forest Classification\n",
      "\n",
      "We choose a random forest model to classify the images. Random forests perform well in many classification tasks and have robust default settings. We will give a brief description of a random forest model so that you can understand its two main free parameters: n_estimators and max_features.\n",
      "\n",
      "A random forest model is an ensemble model of n_estimators number of decision trees. During the training process, each decision tree is grown automatically by making a series of conditional splits on the data. At each split in the decision tree, a random sample of max_features number of features is chosen and used to make a conditional decision on which of the two nodes that the data will be grouped in. The best condition for the split is determined by the split that maximizes the class purity of the nodes directly below. The tree continues to grow by making additional splits until the leaves are pure or the leaves have less than the minimum number of samples for a split (in sklearn default for min_samples_split is two data points). The final majority class purity of the terminal nodes of the decision tree are used for making predictions on what class a new data point will belong. Then, the aggregate vote across the forest determines the class prediction for new samples."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The competition scoring uses a multiclass log-loss metric to compute your overall score. In the next steps, we define the multiclass log-loss function and compute your estimated score on the training dataset."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def multiclass_log_loss(y_true, y_pred, eps=1e-15):\n",
      "    \"\"\"Multi class version of Logarithmic Loss metric.\n",
      "    https://www.kaggle.com/wiki/MultiClassLogLoss\n",
      "\n",
      "    Parameters\n",
      "    ----------\n",
      "    y_true : array, shape = [n_samples]\n",
      "            true class, intergers in [0, n_classes - 1)\n",
      "    y_pred : array, shape = [n_samples, n_classes]\n",
      "\n",
      "    Returns\n",
      "    -------\n",
      "    loss : float\n",
      "    \"\"\"\n",
      "    predictions = np.clip(y_pred, eps, 1 - eps)\n",
      "\n",
      "    # normalize row sums to 1\n",
      "    predictions /= predictions.sum(axis=1)[:, np.newaxis]\n",
      "\n",
      "    actual = np.zeros(y_pred.shape)\n",
      "    n_samples = actual.shape[0]\n",
      "    actual[np.arange(n_samples), y_true.astype(int)] = 1\n",
      "    vectsum = np.sum(actual * np.log(predictions))\n",
      "    loss = -1.0 / n_samples * vectsum\n",
      "    return loss"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 8
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Get the probability predictions for computing the log-loss function\n",
      "kf = KFold(y, n_folds=5)\n",
      "# prediction probabilities number of samples, by number of classes\n",
      "y_pred = np.zeros((len(y),len(set(y))))\n",
      "for train, test in kf:\n",
      "    X_train, X_test, y_train, y_test = X[train,:], X[test,:], y[train], y[test]\n",
      "    clf = RF(n_estimators=100, n_jobs=3)\n",
      "    clf.fit(X_train, y_train)\n",
      "    y_pred[test] = clf.predict_proba(X_test)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 9
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "multiclass_log_loss(y, y_pred)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 10,
       "text": [
        "3.6572165879194749"
       ]
      }
     ],
     "prompt_number": 10
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The multiclass log loss function is an classification error metric that heavily penalizes you for being both confident (either predicting very high or very low class probability) and wrong. Throughout the competition you will want to check that your model improvements are driving this loss metric lower."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Where to Go From Here\n",
      "\n",
      "Now that you've made a simple metric, created a model, and examined the model's performance on the training data, the next step is to make improvements to your model to make it more competitive. The random forest model we created does not perform evenly across all classes and in some cases fails completely. By creating new features and looking at some of your distributions for the problem classes directly, you can identify features that specifically help separate those classes from the others. You can add new metrics by considering other image properties, stratified sampling, transformations, or other models for the classification."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Submissions"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "header = \"acantharia_protist_big_center,acantharia_protist_halo,acantharia_protist,amphipods,appendicularian_fritillaridae,appendicularian_s_shape,appendicularian_slight_curve,appendicularian_straight,artifacts_edge,artifacts,chaetognath_non_sagitta,chaetognath_other,chaetognath_sagitta,chordate_type1,copepod_calanoid_eggs,copepod_calanoid_eucalanus,copepod_calanoid_flatheads,copepod_calanoid_frillyAntennae,copepod_calanoid_large_side_antennatucked,copepod_calanoid_large,copepod_calanoid_octomoms,copepod_calanoid_small_longantennae,copepod_calanoid,copepod_cyclopoid_copilia,copepod_cyclopoid_oithona_eggs,copepod_cyclopoid_oithona,copepod_other,crustacean_other,ctenophore_cestid,ctenophore_cydippid_no_tentacles,ctenophore_cydippid_tentacles,ctenophore_lobate,decapods,detritus_blob,detritus_filamentous,detritus_other,diatom_chain_string,diatom_chain_tube,echinoderm_larva_pluteus_brittlestar,echinoderm_larva_pluteus_early,echinoderm_larva_pluteus_typeC,echinoderm_larva_pluteus_urchin,echinoderm_larva_seastar_bipinnaria,echinoderm_larva_seastar_brachiolaria,echinoderm_seacucumber_auricularia_larva,echinopluteus,ephyra,euphausiids_young,euphausiids,fecal_pellet,fish_larvae_deep_body,fish_larvae_leptocephali,fish_larvae_medium_body,fish_larvae_myctophids,fish_larvae_thin_body,fish_larvae_very_thin_body,heteropod,hydromedusae_aglaura,hydromedusae_bell_and_tentacles,hydromedusae_h15,hydromedusae_haliscera_small_sideview,hydromedusae_haliscera,hydromedusae_liriope,hydromedusae_narco_dark,hydromedusae_narco_young,hydromedusae_narcomedusae,hydromedusae_other,hydromedusae_partial_dark,hydromedusae_shapeA_sideview_small,hydromedusae_shapeA,hydromedusae_shapeB,hydromedusae_sideview_big,hydromedusae_solmaris,hydromedusae_solmundella,hydromedusae_typeD_bell_and_tentacles,hydromedusae_typeD,hydromedusae_typeE,hydromedusae_typeF,invertebrate_larvae_other_A,invertebrate_larvae_other_B,jellies_tentacles,polychaete,protist_dark_center,protist_fuzzy_olive,protist_noctiluca,protist_other,protist_star,pteropod_butterfly,pteropod_theco_dev_seq,pteropod_triangle,radiolarian_chain,radiolarian_colony,shrimp_caridean,shrimp_sergestidae,shrimp_zoea,shrimp-like_other,siphonophore_calycophoran_abylidae,siphonophore_calycophoran_rocketship_adult,siphonophore_calycophoran_rocketship_young,siphonophore_calycophoran_sphaeronectes_stem,siphonophore_calycophoran_sphaeronectes_young,siphonophore_calycophoran_sphaeronectes,siphonophore_other_parts,siphonophore_partial,siphonophore_physonect_young,siphonophore_physonect,stomatopod,tornaria_acorn_worm_larvae,trichodesmium_bowtie,trichodesmium_multiple,trichodesmium_puff,trichodesmium_tuft,trochophore_larvae,tunicate_doliolid_nurse,tunicate_doliolid,tunicate_partial,tunicate_salp_chains,tunicate_salp,unknown_blobs_and_smudges,unknown_sticks,unknown_unclassified\".split(',')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 12
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "labels = map(lambda s: s.split('/')[-1], namesClasses)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 13
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#get the total test images\n",
      "fnames = glob.glob(os.path.join(\"competition_data\", \"test\", \"*.jpg\"))\n",
      "numberofTestImages = len(fnames)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 14
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "numberofTestImages"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 15,
       "text": [
        "130400"
       ]
      }
     ],
     "prompt_number": 15
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "X_test = np.zeros((numberofTestImages, num_features), dtype=float)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 16
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "images = map(lambda fileName: fileName.split('/')[-1], fnames)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 17
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "i = 0\n",
      "# report progress for each 5% done  \n",
      "report = [int((j+1)*numberofTestImages/20.) for j in range(20)]\n",
      "for fileName in fnames:\n",
      "    # Read in the images and create the features\n",
      "    image = imread(fileName, as_grey=True)\n",
      "    axisratio = getMinorMajorRatio(image)\n",
      "    image = resize(image, (maxPixel, maxPixel))\n",
      "\n",
      "    # Store the rescaled image pixels and the axis ratio\n",
      "    X_test[i, 0:imageSize] = np.reshape(image, (1, imageSize))\n",
      "    X_test[i, imageSize] = axisratio\n",
      " \n",
      "    i += 1\n",
      "    if i in report: print np.ceil(i *100.0 / numberofTestImages), \"% done\""
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "5.0 % done\n",
        "10.0"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " % done\n",
        "15.0"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " % done\n",
        "20.0"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " % done\n",
        "25.0"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " % done\n",
        "30.0"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " % done\n",
        "35.0"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " % done\n",
        "40.0"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " % done\n",
        "45.0"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " % done\n",
        "50.0"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " % done\n",
        "55.0"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " % done\n",
        "60.0"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " % done\n",
        "65.0"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " % done\n",
        "70.0"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " % done\n",
        "75.0"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " % done\n",
        "80.0"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " % done\n",
        "85.0"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " % done\n",
        "90.0"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " % done\n",
        "95.0"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " % done\n",
        "100.0"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " % done\n"
       ]
      }
     ],
     "prompt_number": 18
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "y_pred = clf.predict_proba(X_test)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 19
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "y_pred.shape"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 20,
       "text": [
        "(130400, 121)"
       ]
      }
     ],
     "prompt_number": 20
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "df = pd.DataFrame(y_pred, columns=labels, index=images)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 21
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "df.index.name = 'image'"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 22
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "df = df[header]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 23
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "df.to_csv('competition_data/submission.csv')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 24
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!gzip competition_data/submission.csv"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 25
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!ls -l competition_data/submission.csv.gz"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "-rw-r--r--  1 udi  staff  6262209 Dec 15 22:14 competition_data/submission.csv.gz\r\n"
       ]
      }
     ],
     "prompt_number": 26
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "which gave 3.656809 (very close to the validation score ~3.7)"
     ]
    }
   ],
   "metadata": {}
  }
 ]
}